1 Introduccion

1.1 Dataset utilizado

En este informe se utilizan datos del sitio: DrivenData. (2015). Pump it Up: Data Mining the Water Table. Retrieved [10 01 2023] from https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table.

El Dataset contiene diversos parametros que caracterizan el funcionamiento de bombas de agua. El objetivo general es predecir cual es es estado de bombas de agua, diferenciando cuales estan funcionales, cuales necesitan alguna reparacion y cuales no funcionan correctamente.

The goal is to predict the operating condition of a waterpoint for each record in the dataset.

1.2 Objetivo

  1. Mediante un analisis exploratorio del dataset, definir las variables imput y objetivo (target) para predecir correctamente el estado de funcionamiento de la bomba de agua.

1.3 Paquetes utilizados

Se utilizan los siguientes paquetes para el analisis:

#-----------------------------------------
# Library loading
rm(list = ls())

suppressPackageStartupMessages({
  library(data.table)
  library(dplyr)
  library(caret)
  library(scales)
  library(ggplot2)
  library(stringi)
  library(stringr)
  library(dataPreparation)
  library(knitr)
  library(kableExtra)
  library(ggpubr)
  library(tictoc)
  library(ggeasy)
  library(lubridate)
  library(inspectdf)
  library(questionr ) # freq
  library(gbm)
})

2 - Iteracion A

En esta seccion se realiza un analisis inicial del dataset y se plantea un modelo inicial, con el objetivo de posteriormente mejorar tanto el dataset usando tecnicas basicas de Feature Engineering y tuneado de modelos. Se aclara que este proyecto no fue presentado a tiempo para ingresar en el concurso.

2.1 Cargando el dataset:

#-----------------------------------------
# Data Loading

datIn_x    <- fread( file = 'Training_set_values.csv', nThread = 2)
datIn_y    <- fread( file = 'Training_set_labels.csv', nThread = 2)
predict_x  <- fread( file = 'Test_set_values.csv', nThread = 2)

2.2 Analisis exploratorio de los datos Inicial

Se busca rapidamente por duplicados en los datos, de haber duplicados estos serian eliminados.

cat("Número de duplicados en los predictores:", sum(duplicated(datIn_x)),'\n')
## Número de duplicados en los predictores: 0
cat("Número de duplicados en las etiquetas:", sum(duplicated(datIn_y)),'\n')
## Número de duplicados en las etiquetas: 0

El numero de variables del problema:

cat("El número de campos es:",ncol(datIn_x),'\n')
## El número de campos es: 40

2.3 - Limpieza de datos cargados

Se procede a limpiar datos y entender las clases del conjunto de entrada. Se elimina las columnas id al no aportar información.

datIn_x$id <- NULL
datIn_y$id <- NULL

# Antes de borrar, se guardan las id a predecir
test_id <- predict_x$id

predict_x$id <- NULL
datIn <- cbind(datIn_x,datIn_y)

2.4 Analisis Exploratorio de la base de datos [EDA]

Se comprueba que si hay el mismo número de columnas en los dataset que el estipulado en la descripción del problema. La estructura de la base de datos es la siguiente:

str(datIn)                 # structure of the R object
## Classes 'data.table' and 'data.frame':   59400 obs. of  40 variables:
##  $ amount_tsh           : num  6000 0 25 0 0 20 0 0 0 0 ...
##  $ date_recorded        : IDate, format: "2011-03-14" "2013-03-06" ...
##  $ funder               : chr  "Roman" "Grumeti" "Lottery Club" "Unicef" ...
##  $ gps_height           : int  1390 1399 686 263 0 0 0 0 0 0 ...
##  $ installer            : chr  "Roman" "GRUMETI" "World vision" "UNICEF" ...
##  $ longitude            : num  34.9 34.7 37.5 38.5 31.1 ...
##  $ latitude             : num  -9.86 -2.15 -3.82 -11.16 -1.83 ...
##  $ wpt_name             : chr  "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : chr  "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
##  $ subvillage           : chr  "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
##  $ region               : chr  "Iringa" "Mara" "Manyara" "Mtwara" ...
##  $ region_code          : int  11 20 21 90 18 4 17 17 14 18 ...
##  $ district_code        : int  5 2 4 63 1 8 3 3 6 1 ...
##  $ lga                  : chr  "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
##  $ ward                 : chr  "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
##  $ population           : int  109 280 250 58 0 1 0 0 0 0 ...
##  $ public_meeting       : logi  TRUE NA TRUE TRUE TRUE TRUE ...
##  $ recorded_by          : chr  "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
##  $ scheme_management    : chr  "VWC" "Other" "VWC" "VWC" ...
##  $ scheme_name          : chr  "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
##  $ permit               : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ construction_year    : int  1999 2010 2009 1986 0 2009 0 0 0 0 ...
##  $ extraction_type      : chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_group: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_class: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ management           : chr  "vwc" "wug" "vwc" "vwc" ...
##  $ management_group     : chr  "user-group" "user-group" "user-group" "user-group" ...
##  $ payment              : chr  "pay annually" "never pay" "pay per bucket" "never pay" ...
##  $ payment_type         : chr  "annually" "never pay" "per bucket" "never pay" ...
##  $ water_quality        : chr  "soft" "soft" "soft" "soft" ...
##  $ quality_group        : chr  "good" "good" "good" "good" ...
##  $ quantity             : chr  "enough" "insufficient" "enough" "dry" ...
##  $ quantity_group       : chr  "enough" "insufficient" "enough" "dry" ...
##  $ source               : chr  "spring" "rainwater harvesting" "dam" "machine dbh" ...
##  $ source_type          : chr  "spring" "rainwater harvesting" "dam" "borehole" ...
##  $ source_class         : chr  "groundwater" "surface" "surface" "groundwater" ...
##  $ waterpoint_type      : chr  "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
##  $ waterpoint_type_group: chr  "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
##  $ status_group         : chr  "functional" "functional" "functional" "non functional" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Podemos ver a partir de la estructura: - La variable date_recorded ha sido correctamente asignada con el tipo Date. Aunque se intuye que tiene demasiadas categorías, y habrá que transformala. - Tres variables han sido incorrectamente asignadas como tipo int. - Dos variables han sido interpretadas como valores lógicos, se procederá a cambiarlas por caracteres. - status_group es de tipo caracter.

Por lo cual se efectúa una copia del dataset y la llamamos datMod. Los cambios se aplicarán sobre datMod (train) y predict_x (test). Se eliminan las variables originales de las que se derivan las transformadas para reducir el riesgo de alta correlación/asociación de variables y con ello el sobreajuste. Se cambia el tipo de la variable objetivo (status_group) por factor.

datMod <- copy(datIn)

datMod$region_code <- as.character(datMod$region_code)
datMod$district_code <- as.character(datMod$district_code)
datMod$construction_year <- as.character(datMod$construction_year)
datMod$permit <- as.character(as.integer(datMod$permit))
datMod$public_meeting <- as.character(as.integer(datMod$public_meeting))

predict_x$region_code <- as.character(predict_x$region_code)
predict_x$district_code <- as.character(predict_x$district_code)
predict_x$construction_year <- as.character(predict_x$construction_year)
predict_x$permit <- as.character(as.integer(predict_x$permit))
predict_x$public_meeting <- as.character(as.integer(predict_x$public_meeting))


datMod$status_group <-  as.factor(datMod$status_group)
str(datMod)
## Classes 'data.table' and 'data.frame':   59400 obs. of  40 variables:
##  $ amount_tsh           : num  6000 0 25 0 0 20 0 0 0 0 ...
##  $ date_recorded        : IDate, format: "2011-03-14" "2013-03-06" ...
##  $ funder               : chr  "Roman" "Grumeti" "Lottery Club" "Unicef" ...
##  $ gps_height           : int  1390 1399 686 263 0 0 0 0 0 0 ...
##  $ installer            : chr  "Roman" "GRUMETI" "World vision" "UNICEF" ...
##  $ longitude            : num  34.9 34.7 37.5 38.5 31.1 ...
##  $ latitude             : num  -9.86 -2.15 -3.82 -11.16 -1.83 ...
##  $ wpt_name             : chr  "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : chr  "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
##  $ subvillage           : chr  "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
##  $ region               : chr  "Iringa" "Mara" "Manyara" "Mtwara" ...
##  $ region_code          : chr  "11" "20" "21" "90" ...
##  $ district_code        : chr  "5" "2" "4" "63" ...
##  $ lga                  : chr  "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
##  $ ward                 : chr  "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
##  $ population           : int  109 280 250 58 0 1 0 0 0 0 ...
##  $ public_meeting       : chr  "1" NA "1" "1" ...
##  $ recorded_by          : chr  "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
##  $ scheme_management    : chr  "VWC" "Other" "VWC" "VWC" ...
##  $ scheme_name          : chr  "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
##  $ permit               : chr  "0" "1" "1" "1" ...
##  $ construction_year    : chr  "1999" "2010" "2009" "1986" ...
##  $ extraction_type      : chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_group: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_class: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ management           : chr  "vwc" "wug" "vwc" "vwc" ...
##  $ management_group     : chr  "user-group" "user-group" "user-group" "user-group" ...
##  $ payment              : chr  "pay annually" "never pay" "pay per bucket" "never pay" ...
##  $ payment_type         : chr  "annually" "never pay" "per bucket" "never pay" ...
##  $ water_quality        : chr  "soft" "soft" "soft" "soft" ...
##  $ quality_group        : chr  "good" "good" "good" "good" ...
##  $ quantity             : chr  "enough" "insufficient" "enough" "dry" ...
##  $ quantity_group       : chr  "enough" "insufficient" "enough" "dry" ...
##  $ source               : chr  "spring" "rainwater harvesting" "dam" "machine dbh" ...
##  $ source_type          : chr  "spring" "rainwater harvesting" "dam" "borehole" ...
##  $ source_class         : chr  "groundwater" "surface" "surface" "groundwater" ...
##  $ waterpoint_type      : chr  "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
##  $ waterpoint_type_group: chr  "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
##  $ status_group         : Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

2.4.1 Estadísticos descriptivos (continuas)/ Tabla de frecuencias (categóricas) para las variables de la base de datos.

Frecuencia de variables categoricas

# categorical plot
x <- inspect_cat(datMod)
show_plot(x)

Correlaciones

# correlations in numeric columns
x <- inspect_cor(datMod)
show_plot(x)

# feature imbalance bar plot
x <- inspect_imb(datMod)
show_plot(x)

# memory usage barplot
x <- inspect_mem(datMod)
show_plot(x)

# missingness barplot
x <- inspect_na(datMod)
show_plot(x)

# histograms for numeric columns
x <- inspect_num(datMod)
show_plot(x)

# barplot of column types
x <- inspect_types(datMod)
show_plot(x)

A partir de los graficos exploratorio podemos desprender: - En general se ve que predominan las variables categoricas. En este caso, un metodo que puede ser util es utilizar modelos de arboles, ya que pueden tener alto grado de acierto en las predicciones. - Variables cuantitativas no tienen unas correlaciones con valores absolutos por encima de 0.95, reduciendo así el riesgo de overfitting. A simple vista, no parecen tener valores fuera de rango. - Algunas variables tienen frecuencias muy desbalancedas o pequeñas que pueden dar lugar a overfitting y altos costes computacionales. - Las variables: funnder, installer, Iga, scheme_name, subvillage, ward, wpt_name tienen muchas categorias (>20). - La variable date_recorded ha sido correctamente asignada con el tipo Date. Aunque se intuye que tiene demasiadas categorías, y habrá que transformala. - Las variables: permit y public_meeting tienen valores ausentes (missing values) mayores a un 5% de los datos. Reclasificaremos los valores faltantes en una clase ‘Desconocido’

datMod$fe_permit<-car::recode(datMod$permit,"NA='Desconocido'")
datMod$fe_public_meeting<-car::recode(datMod$public_meeting,"NA='Desconocido'")
datMod$permit <- NULL
datMod$public_meeting <- NULL


predict_x$fe_permit<-car::recode(predict_x$permit,"NA='Desconocido'")
predict_x$fe_public_meeting<-car::recode(predict_x$public_meeting,"NA='Desconocido'")
predict_x$permit <- NULL
predict_x$public_meeting <- NULL
  • Se identifican variable recorder_by para eliminar del analisis ya que no aportan informacion relevante al problema. Tambien la variable num_private, ya que la mayoria (>98%) de los valores son nulos.
freq(datMod$num_private)  
freq(datMod$recorded_by)
datMod$num_private <- NULL
datMod$recorded_by <- NULL
predict_x$num_private <- NULL
predict_x$recorded_by <- NULL
  • Se evalúa si amount_tsh y population presentan menos de 10 categorías. De ser así, se cambiaría su tipo a categórica. Se analizan sus frecuencias de valores.

  • Variable amount_tsh: contiene mas de un 70%, lo cual indicaria que dichos acuiferos estan secos, se asume esta es informacion real y no datos faltantes. Por tanto, se conserva esta variable.

  • Variable population: Muestra que la zona próxima en un 36% de los acuíferos esta deshabitada, por lo que en consonancia con la política de mínimas transformaciones para esta primera iteración, no se modifica esta variable.

  • Se confirma que longitude tiene más de 10 valores, lo cual el histograma del EDA daba lugar a dudas.

freq(datMod$amount_tsh, sort = 'dec')
freq(datMod$population, sort = 'dec')
freq(datMod$longitude, sort = 'dec')
freq(datMod$construction_year, sort = 'dec')
  • La variable construction_year contiene valores 0, lo cual indica datos erroneos o faltantes. Se asume como missing, cuya frecuencia es 34.9%. Se formará la categoría ‘Desconocido’ para estos datos faltantes.
datMod$fe_construction_year<-car::recode(datMod$construction_year,"0='Desconocido'")
datMod$construction_year <- NULL
freq(datMod$fe_construction_year, sort = 'dec')
predict_x$fe_construction_year<-car::recode(predict_x$construction_year,"0='Desconocido'")
predict_x$construction_year <- NULL
  • Sera necesario usar stratifiedKfolds para rebalancear las clases en la variable StatusGroup.
freq(datMod$status_group)

2.5 - Feature Engineering inicial

La variable date_recorded presenta multitud de categorías que 1) aumentan tiempo de computación y 2) incrementan el riesgo de overfitting. Se crearán nuevas variables a partir de ella y se luego se eliminará.

datMod$fe_anio    <- year(datMod$date_recorded)
datMod$fe_mes     <- month(datMod$date_recorded)
datMod$fe_dianum  <- day(datMod$date_recorded)
datMod$fe_diasem  <- wday(datMod$date_recorded, label = TRUE, abbr = TRUE)
datMod$date_recorded <- NULL

predict_x$fe_anio    <- year(predict_x$date_recorded)
predict_x$fe_mes     <- month(predict_x$date_recorded)
predict_x$fe_dianum  <- day(predict_x$date_recorded)
predict_x$fe_diasem  <- wday(predict_x$date_recorded, label = TRUE, abbr = TRUE)
predict_x$date_recorded <- NULL

Hay otras variables categóricas que presentan muchas categorías, como se puede observar en el EDA y en la siguiente tabla. Se les aplicará una codificación (transformación) lumping a aquellas con mayor número de categorías.

freq.input.cat <- sapply(Filter(is.character, datMod),function(x) length(unique(x))) 

freq.input.cat <- sort(freq.input.cat, decreasing = TRUE)

kable(freq.input.cat)
x
wpt_name 37400
subvillage 19288
scheme_name 2697
installer 2146
ward 2092
funder 1898
lga 125
fe_construction_year 55
region_code 27
region 21
district_code 20
extraction_type 18
scheme_management 13
extraction_type_group 13
management 12
source 10
basin 9
water_quality 8
extraction_type_class 7
payment 7
payment_type 7
source_type 7
waterpoint_type 7
quality_group 6
waterpoint_type_group 6
management_group 5
quantity 5
quantity_group 5
source_class 3
fe_permit 3
fe_public_meeting 3

Se hace lumping a las 6 primeras variables, las cuales presentan miles de categorías. Para esas 6 variables, se agrupan aquellas categorías con menos de un 5% de frecuencia. Si no hay ninguna categoría mayor de 5%, se elimina la variable. Obviamente todas estas transformaciones que se aplican al train se deben de efectuarse también en el test.

La variable wpt_name pasa a tener dos categorías, ‘named’ o ‘none’. Se comprueba que ambos datasets tienen frecuencias muy parecidas, indicando (de forma no suficiente) que pueden venir ambos de una misma muestra.

freq(datMod$wpt_name, sort = 'dec')
# categorías exceptuando 'none'
wpt.name.labels <- unique(datMod$wpt_name)
wpt.name.labels <- wpt.name.labels[!(wpt.name.labels %in% 'none')]

datMod$fe_wpt_name <-car::recode(datMod$wpt_name, "wpt.name.labels = 'named'")
datMod$wpt_name <- NULL

# repito etiquetqas por que hay resíduos no contemplados en train
wpt.name.labels <- unique(predict_x$wpt_name)
wpt.name.labels <- wpt.name.labels[!(wpt.name.labels %in% 'none')]
predict_x$fe_wpt_name <-car::recode(predict_x$wpt_name, "wpt.name.labels = 'named'")
predict_x$wpt_name <- NULL

freq(datMod$fe_wpt_name, sort = 'dec')
freq(predict_x$fe_wpt_name, sort = 'dec')

Se elimina la categoría subvillage por no tener ninguna frecuencia superior al 5%.

freq(datMod$subvillage, sort = 'dec')
datMod$subvillage <- NULL
predict_x$subvillage <- NULL

Se pasó por alto el gran número de missings en scheme_name. Al no suponer más del 50% se renombrarán como ‘Desconocido’ y a las demás se agruparán como ‘Conocido’. De nuevo, training y set tienen frecuencias similares tras renombrar.

freq(datMod$scheme_name, sort = 'dec')
scheme_name.labels <- unique(datMod$scheme_name)
scheme_name.labels <- scheme_name.labels[!(scheme_name.labels %in% "")]
datMod$fe_scheme_name <-car::recode(datMod$scheme_name, "scheme_name.labels = 'Conocido'")
datMod$fe_scheme_name <-car::recode(datMod$fe_scheme_name, "'' = 'Desconocido'")
datMod$scheme_name <- NULL

scheme_name.labels <- unique(predict_x$scheme_name)
scheme_name.labels <- scheme_name.labels[!(scheme_name.labels %in% "")]
predict_x$fe_scheme_name <-car::recode(predict_x$scheme_name, "scheme_name.labels = 'Conocido'")
predict_x$fe_scheme_name <-car::recode(predict_x$fe_scheme_name, "'' = 'Desconocido'")
predict_x$scheme_name <- NULL


freq(datMod$fe_scheme_name, sort = 'dec')

Agrupamos en ‘Otros’ en la variable installer y renombramos los missings no detectados anteriormente al pasar de 5%.

freq(datMod$installer, sort = 'dec')
installer.labels <- unique(datMod$installer)
installer.labels <- installer.labels[!(installer.labels %in% c('','DWE'))]
datMod$fe_installer <-car::recode(datMod$installer, "installer.labels = 'Otros'")
datMod$fe_installer <-car::recode(datMod$fe_installer, "'' = 'Desconocido'")
datMod$installer <- NULL
freq(datMod$fe_installer, sort = 'dec')
installer.labels <- unique(predict_x$installer)
installer.labels <- installer.labels[!(installer.labels %in% c('','DWE'))]
predict_x$fe_installer <-car::recode(predict_x$installer, "installer.labels = 'Otros'")
predict_x$fe_installer <-car::recode(predict_x$fe_installer, "'' = 'Desconocido'")
predict_x$installer <- NULL
freq(predict_x$fe_installer, sort = 'dec')

Se elimina la variable ward, ya que es redundante a la variable lga que indica ubicacion geografica.

freq(datMod$ward, sort = 'dec')
datMod$ward <- NULL
predict_x$ward <- NULL

Se modifica la variable funder.

freq(datMod$funder, sort = 'dec')
funder.labels <- unique(datMod$funder)
funder.labels <- funder.labels[!(funder.labels %in% c('','Government Of Tanzania','Danida'))]
datMod$fe_funder <-car::recode(datMod$funder, "funder.labels = 'Otros'")
datMod$fe_funder <-car::recode(datMod$fe_funder, "'' = 'Desconocido'")
freq(datMod$fe_funder, sort = 'dec')
datMod$funder <- NULL


funder.labels <- unique(predict_x$funder)
funder.labels <- funder.labels[!(funder.labels %in% c('','Government Of Tanzania','Danida'))]
predict_x$fe_funder <-car::recode(predict_x$funder, "funder.labels = 'Otros'")
predict_x$fe_funder <-car::recode(predict_x$fe_funder, "'' = 'Desconocido'")
freq(predict_x$fe_funder, sort = 'dec')
predict_x$funder <- NULL

2.6 - Modelos

Los algoritmos a analizar son solo dos, ranger (random forest) y gbm (boosting) por restricciones de tiempo. Se construyen modelos a partir del los inputs modificados en esta primera iteración, datMod, a evaluar en el igualmente transformado conjunto test predict_x.

Se divide datMod en un conjunto train y otro validation, siendo las proporciones 80% y 20% respectivamente. Los modelos se entrenan con train y se evalúan con validation. Posteriormente, se probarán con el conjunto test aquellos modelos que presenten una accuracy más alta en validación.

Dividir el train en dos subconjuntos tiene la desventaja de que se entrena con una población menor pero se consigue tener un muestra completamente independiente para evaluar antes de subir las predicciones al sitio web, donde solo se permite 3 soluciones al día. De no haber tal restricción no se habría establecido el conjunto validación. No obstante, dado al número alto de muestras proporcionados, quitar un 20% no tendría que suponer una fuerte penalización en cuanto a accuracy.

Debido a las restricciones de deadline y capacidad computacional, se evalúan dos técnicas de remuestreo, niguna o ‘none’ (un solo fold) y stratifiedKfolds. El none sirve para estimar los tiempos de computación requeridos, lo cual condiciona el número de folds y repeticiones asumible. Por otro lado, al haber variables con muchas categorías, podría reducir el overfitting también en este problema en particular. No se aplica grid search en los hiperparámetros.

Uso de la función de caret createDataPartition para hacer el remuestreo tipo stratifiedKfolds (5 folds) con shuffle.

En esta primera iteración no se hacen repeticiones con ningún modelo.

Se empieza con un ranger, con parámetros con valores o bien por defecto o bien ‘comunes’, al ser usados en códigos de profesores del máster en Data Science que el autor está cursando. No hay grid search paramétrico en esta primera iteración para ningún algoritmo.

En aras de mantener una cierta consitencia, los valores de parámetros se mantendrán iguales para aquellos compartidos por los algoritmos. Obviamente, dicha consistencia no es perfecta al tratarse de distintos algoritmos. Por ejemplo, num.trees = 100 tanto para ranger como gbm.

#------------------------------- TRAIN -TEST - SPLIT 

library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(ranger)
library(tictoc)
library(caret)
library(ggplot2)
library(lattice)
# registerDoMC(2)

# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(2023)
validationIndex <- createDataPartition(datMod$status_group, p=0.80, list=FALSE)

# select 20% of the data for validation
my_test  <- datMod[-validationIndex,]
# use the remaining 80% of data to training and testing the 
my_train <- datMod[validationIndex,]
tic()
set.seed(2023)
fit_r1 <- ranger(
               status_group ~. , 
               data = my_train,
               num.trees = 100,
               importance = 'impurity',
               write.forest = TRUE,
               min.node.size = 1,
               splitrule = "gini",
               verbose = TRUE,
               classification = TRUE
             )

toc()
## 14.464 sec elapsed

El tiempo de ejecucion es alrededor de 20 segundos, esto da una idea de cómo de exhaustivos podemos ser ante futuras evaluaciones con cv (stratified) y grid search.

# display results
#print(fit)
fit_r1
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train, num.trees = 100, importance = "impurity",      write.forest = TRUE, min.node.size = 1, splitrule = "gini",      verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.93 %
summary(fit_r1)
##                           Length Class         Mode     
## predictions               47522  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## variable.importance          38  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              9  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                         10  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical

Importancia de las variables

vars_imp <- fit_r1$variable.importance
vars_imp <- as.data.frame(vars_imp)
vars_imp$myvar <- rownames(vars_imp)


library(ggpubr) 
ggbarplot(vars_imp[1:10,],
          x = "myvar", y = "vars_imp",
          #fill  = 'myvar',
          color = "blue",             # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",          # Sort the value in descending order
          sort.by.groups = FALSE,     # Don't sort inside each group
          x.text.angle = 90,          # Rotate vertically x axis texts
          ylab = "Importancia",
          xlab = 'Variable', 
          #legend.title = "MPG Group",
          rotate = TRUE,
          ggtheme = theme_minimal(),
          main = "ranger con dataMod"
          )

Validacion con el Test set.

validation.fit_r1 <- predict(fit_r1, data = my_test)
confusionMatrix( my_test$status_group, validation.fit_r1$predictions)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    5790                     136            525
##   functional needs repair        458                     277            128
##   non functional                 954                      87           3523
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8074          
##                  95% CI : (0.8002, 0.8144)
##     No Information Rate : 0.6063          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6383          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.8039                        0.55400
## Specificity                     0.8586                        0.94850
## Pos Pred Value                  0.8975                        0.32097
## Neg Pred Value                  0.7398                        0.97975
## Prevalence                      0.6063                        0.04209
## Detection Rate                  0.4875                        0.02332
## Detection Prevalence            0.5431                        0.07266
## Balanced Accuracy               0.8313                        0.75125
##                      Class: non functional
## Sensitivity                         0.8436
## Specificity                         0.8648
## Pos Pred Value                      0.7719
## Neg Pred Value                      0.9107
## Prevalence                          0.3516
## Detection Rate                      0.2966
## Detection Prevalence                0.3842
## Balanced Accuracy                   0.8542

Despues de esta primera iteracion, se ha obtenido un acierto del 80% (0.8074). El OOB prediction error sale 18.93%. Procedemos a almacenar la prediccion en un archivo csv para ser utilizado como input en la plataforma del concurso:

prediction.fit_r1 <- predict(fit_r1, data = predict_x)
file.prediction.fit_r1 <- cbind.data.frame(test_id, prediction.fit_r1$predictions)
colnames(file.prediction.fit_r1) <- c("id","status_group")
fwrite(file.prediction.fit_r1, file ="predi_fit_r1.csv",sep=)

3 - Iteracion B

La primera iteración consistió en lanzar modelos con las mínimas transformaciones posibles, para evitar excesivos overfitting y tiempos de computación. En la iteración 2 se realizan tres transformaciones extra al conjunto de variables input:

  • Cambiar algunas variables derivadas de la fecha a categóricas.
  • Tratamiento de los outliers en las variables numéricas.
  • Lumping en algunas variables categóricas que aún tengan categorías infrarrepresentadas .

Se aplican unos límites menos agresivos que en la primera iteración, dado que el número de categorías ya ha sido reducido en la iteración A y el número de observaciones es bastante grande. Dicho límite es 2% tanto para recategorizar como para considerar un valor como outlier.

Lo primero, se transforman algunas variables derivadas de la fecha a categóricas, ya que se presuponen valores discretos. El sufijo ‘2’ que se añade para renombrar estas variables representa la iteración donde estas se han transformado.

Los conjuntos train/test derivados de esta segunda iteración pasarán a llamarse datMod2 y predict_x2 respectivamente.

datMod2 <- copy(datMod)
predict_x2 <- copy(predict_x)

datMod2$fe_anio2 <- as.character(datMod2$fe_anio)     
datMod2$fe_mes2 <- as.character(datMod2$fe_mes)         
datMod2$fe_dianum2 <- as.character(datMod2$fe_dianum)  
datMod2$fe_anio  <- NULL       
datMod2$fe_mes  <- NULL           
datMod2$fe_dianum  <- NULL 
 
predict_x2$fe_anio2 <- as.character(predict_x2$fe_anio)     
predict_x2$fe_mes2 <- as.character(predict_x2$fe_mes)         
predict_x2$fe_dianum2 <- as.character(predict_x2$fe_dianum)  
predict_x2$fe_anio  <- NULL       
predict_x2$fe_mes  <- NULL           
predict_x2$fe_dianum  <- NULL 


 
str(datMod2) 
## Classes 'data.table' and 'data.frame':   59400 obs. of  39 variables:
##  $ amount_tsh           : num  6000 0 25 0 0 20 0 0 0 0 ...
##  $ gps_height           : int  1390 1399 686 263 0 0 0 0 0 0 ...
##  $ longitude            : num  34.9 34.7 37.5 38.5 31.1 ...
##  $ latitude             : num  -9.86 -2.15 -3.82 -11.16 -1.83 ...
##  $ basin                : chr  "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
##  $ region               : chr  "Iringa" "Mara" "Manyara" "Mtwara" ...
##  $ region_code          : chr  "11" "20" "21" "90" ...
##  $ district_code        : chr  "5" "2" "4" "63" ...
##  $ lga                  : chr  "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
##  $ population           : int  109 280 250 58 0 1 0 0 0 0 ...
##  $ scheme_management    : chr  "VWC" "Other" "VWC" "VWC" ...
##  $ extraction_type      : chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_group: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_class: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ management           : chr  "vwc" "wug" "vwc" "vwc" ...
##  $ management_group     : chr  "user-group" "user-group" "user-group" "user-group" ...
##  $ payment              : chr  "pay annually" "never pay" "pay per bucket" "never pay" ...
##  $ payment_type         : chr  "annually" "never pay" "per bucket" "never pay" ...
##  $ water_quality        : chr  "soft" "soft" "soft" "soft" ...
##  $ quality_group        : chr  "good" "good" "good" "good" ...
##  $ quantity             : chr  "enough" "insufficient" "enough" "dry" ...
##  $ quantity_group       : chr  "enough" "insufficient" "enough" "dry" ...
##  $ source               : chr  "spring" "rainwater harvesting" "dam" "machine dbh" ...
##  $ source_type          : chr  "spring" "rainwater harvesting" "dam" "borehole" ...
##  $ source_class         : chr  "groundwater" "surface" "surface" "groundwater" ...
##  $ waterpoint_type      : chr  "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
##  $ waterpoint_type_group: chr  "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
##  $ status_group         : Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
##  $ fe_permit            : chr  "0" "1" "1" "1" ...
##  $ fe_public_meeting    : chr  "1" "Desconocido" "1" "1" ...
##  $ fe_construction_year : chr  "1999" "2010" "2009" "1986" ...
##  $ fe_diasem            : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 4 2 2 4 1 2 3 7 4 ...
##  $ fe_wpt_name          : chr  "none" "named" "named" "named" ...
##  $ fe_scheme_name       : chr  "Conocido" "Desconocido" "Conocido" "Desconocido" ...
##  $ fe_installer         : chr  "Otros" "Otros" "Otros" "Otros" ...
##  $ fe_funder            : chr  "Otros" "Otros" "Otros" "Otros" ...
##  $ fe_anio2             : chr  "2011" "2013" "2013" "2013" ...
##  $ fe_mes2              : chr  "3" "3" "2" "1" ...
##  $ fe_dianum2           : chr  "14" "6" "25" "28" ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.1 Estadísticos descriptivos

# categorical plot
x <- inspect_cat(datMod2) 
show_plot(x)

# correlations in numeric columns
x <- inspect_cor(datMod2)
show_plot(x)

# feature imbalance bar plot
x <- inspect_imb(datMod2)
show_plot(x)

# memory usage barplot
x <- inspect_mem(datMod2)
show_plot(x)

# missingness barplot
x <- inspect_na(datMod2)
show_plot(x)

# histograms for numeric columns
x <- inspect_num(datMod2)
show_plot(x)

# barplot of column types
x <- inspect_types(datMod2)
show_plot(x)

3.2 Variables Numericas

Las variables amount_tsh y population presentan algunos valores muy altos, observando la mediana y la media del conjunto train. Se comprueba si se tratan de outliers o de ‘datos grandes’, en función del porcentaje sobre el train.

Esta comprobación también se extiende, con menor riesgo de presencia de outliers, al resto de variables numéricas.

print("amount_tsh")
## [1] "amount_tsh"
summary(datMod2$amount_tsh)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    317.7     20.0 350000.0
print("population")
## [1] "population"
summary(datMod2$population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    25.0   179.9   215.0 30500.0
print("gps_height")
## [1] "gps_height"
summary(datMod2$gps_height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -90.0     0.0   369.0   668.3  1319.2  2770.0
print("longitude")
## [1] "longitude"
summary(datMod2$longitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   33.09   34.91   34.08   37.18   40.35
print("latitude")
## [1] "latitude"
summary(datMod2$latitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.649  -8.541  -5.022  -5.706  -3.326   0.000

Boxplots estandarizados con el valor máximo, para ayudar a visualizar la proporción de posibles outliers.

amount_tsh <- boxplot.stats(datMod2$amount_tsh)
100*length(amount_tsh$out)/nrow(datMod2)
## [1] 18.78956
population <- boxplot.stats(datMod2$population)
100*length(population$out)/nrow(datMod2)
## [1] 7.378788
gps_height <- boxplot.stats(datMod2$gps_height)
100*length(gps_height$out)/nrow(datMod2)
## [1] 0
boxplot(datMod2$amount_tsh/max(datMod2$amount_tsh), main="boxplot amount_tsh")

boxplot(datMod2$population/max(datMod2$population), main="boxplot populaton")

boxplot(datMod2$gps_height/max(datMod2$gps_height), main="boxplot gps_height")

Las variables amount_tsh y population presentan una porcentaje amplio de datos grandes, mayor del 5%. Por tanto, no pueden borrarse todos directamente. Sin embargo, viendo los boxplots estandarizados, se aprecia que hay unas observaciones muy grandes que podrían perjudicar los modelos.

Por tanto, se va a evaluar cómo quedan los boxplots poniendo como NAs aquellos valores en el percentil 98% e imputando con las medianas de las variables en el train. Estos boxplots están normalizados con los valores máximos del conjunto resultate de la iteración 1, datMod.

Como se mencionó anteriormente, se aplican los valores medianos del datMod2 a predict_x2 de cara a la imputación.

upper.bound.amount_tsh <- quantile(datMod2$amount_tsh, 0.98)
median.amount_tsh <- median(datMod2$amount_tsh, na.rm = TRUE)
datMod2$fe_amount_tsh2 <- copy(datMod2$amount_tsh)
datMod2$fe_amount_tsh2[datMod2$fe_amount_tsh2>upper.bound.amount_tsh] <- NA
datMod2$fe_amount_tsh2[is.na(datMod2$fe_amount_tsh2)] <- median.amount_tsh
boxplot(datMod2$fe_amount_tsh2/max(datMod2$amount_tsh),main="boxplot fe_amount_tsh2")

datMod2$amount_tsh <- NULL

upper.bound.population <- quantile(datMod2$population, 0.98)
median.population <- median(datMod2$population, na.rm = TRUE)
datMod2$fe_population2 <- copy(datMod2$population)
datMod2$fe_population2[datMod2$fe_population2>upper.bound.population] <- NA
datMod2$fe_population2[is.na(datMod2$fe_population2)] <- median.population
boxplot(datMod2$fe_population2/max(datMod2$population),main="boxplot fe_populaton2")

datMod2$population <- NULL

predict_x2$fe_amount_tsh2 <- copy(predict_x2$amount_tsh)
predict_x2$fe_amount_tsh2[predict_x2$fe_amount_tsh2>upper.bound.amount_tsh] <- NA
predict_x2$fe_amount_tsh2[is.na(predict_x2$fe_amount_tsh2)] <- median.amount_tsh
boxplot(predict_x2$fe_amount_tsh2/max(predict_x2$amount_tsh),main="boxplot fe_amount_tsh2")

predict_x2$amount_tsh <- NULL

upper.bound.population <- quantile(predict_x2$population, 0.98)
median.population <- median(predict_x2$population, na.rm = TRUE)
predict_x2$fe_population2 <- copy(predict_x2$population)
predict_x2$fe_population2[predict_x2$fe_population2>upper.bound.population] <- NA
predict_x2$fe_population2[is.na(predict_x2$fe_population2)] <- median.population
boxplot(predict_x2$fe_population2/max(predict_x2$population),main="boxplot fe_populaton2")

predict_x2$population <- NULL

Se aprecia como los datos están más compactos, comparando la escala del eje y con respecto a los boxplots anteriores.

3.3 Variables categóricas

Antes de aplicar las siguientes transformaciones, se guarda el datMod2 con el tratamiento de outliers ya realizado en una nueva variable (datMod2_num) para la iteración 3. De esta forma, se puede comparar el efecto de eliminar outliers de forma independiente.

A continuación se realiza un lumping, agrupando aquellas variables que presenten categorías con frecuencias inferiores a 2%. En este proceso se excluye la variable día del mes, fe_dianum2.

datMod2_num <- copy(datMod2)
predict_x2_num <- copy(predict_x2)
status_index <- grep("status_group", names(datMod2) , fixed = TRUE )
# Movemos status_group a la ultima columna, aseguro que predict_x2 y datMod2 presentan la misma enumeración, ya que las siguientes líneas lo asumen 
datMod2 <- cbind(datMod2[,-26], datMod2[,26])
char.indices <- as.data.frame( which(sapply(datMod2, function(x) is.character(x)))  )
char.indices <- char.indices[!char.indices %in% char.indices[rownames(char.indices) == "fe_dianum2", 1]]
indices.changed <- c()

  
for (char.indice in char.indices[,1])
{ 
  a <- as.data.frame( freq( datMod2[ , char.indice[1] ],sort = "inc")) 
  #indices <- a[a[,3] < 2,]
  nrows <- nrow( a[a[,3] < 2,] )     # ,  print (a[a[,3] < 2,2])
  #print ( nrows)
  
  if (nrows>0)
  { 
    indices.changed <- append(indices.changed, char.indice[1] )
    datMod2[,char.indice[1]]  <-car::recode(datMod2[,char.indice[1]],  "rownames(indices) = 'Otros'")
    predict_x2[,char.indice[1]]  <-car::recode(predict_x2[,char.indice[1]],  "rownames(indices) = 'Otros'")
  }
}


library(stringr)

for (indice.changed in indices.changed)
{
  name <-  colnames(datMod2)[indice.changed]
  if ((substr(name,1,3) != "fe_" ) || ( name == 'lga') )
  {
    name <- paste("fe_", name, sep="")
}
   
if (str_sub(name, start= -1) != "2")  
{
  name <- paste( name,"2", sep="")
  colnames(datMod2)[indice.changed] <- name 
  colnames(predict_x2)[indice.changed] <- name 
}
}

Comprobación rápida del train y test, ambos conjuntos contienen el mismo número de columnas y similares proporciones. Como se esperaba, el gráfico de abajo muestra menos categorías que en la iteración 1. Es decir, está menos ‘segmentado’.

colnames(datMod2)
##  [1] "gps_height"            "longitude"             "latitude"             
##  [4] "basin"                 "region"                "region_code"          
##  [7] "district_code"         "lga"                   "scheme_management"    
## [10] "extraction_type"       "extraction_type_group" "extraction_type_class"
## [13] "management"            "management_group"      "payment"              
## [16] "payment_type"          "water_quality"         "quality_group"        
## [19] "quantity"              "quantity_group"        "source"               
## [22] "source_type"           "source_class"          "waterpoint_type"      
## [25] "waterpoint_type_group" "fe_permit"             "fe_public_meeting"    
## [28] "fe_construction_year"  "fe_diasem"             "fe_wpt_name"          
## [31] "fe_scheme_name"        "fe_installer"          "fe_funder"            
## [34] "fe_anio2"              "fe_mes2"               "fe_dianum2"           
## [37] "fe_amount_tsh2"        "fe_population2"        "status_group"
colnames(predict_x2)
##  [1] "gps_height"            "longitude"             "latitude"             
##  [4] "basin"                 "region"                "region_code"          
##  [7] "district_code"         "lga"                   "scheme_management"    
## [10] "extraction_type"       "extraction_type_group" "extraction_type_class"
## [13] "management"            "management_group"      "payment"              
## [16] "payment_type"          "water_quality"         "quality_group"        
## [19] "quantity"              "quantity_group"        "source"               
## [22] "source_type"           "source_class"          "waterpoint_type"      
## [25] "waterpoint_type_group" "fe_permit"             "fe_public_meeting"    
## [28] "fe_construction_year"  "fe_diasem"             "fe_wpt_name"          
## [31] "fe_scheme_name"        "fe_installer"          "fe_funder"            
## [34] "fe_anio2"              "fe_mes2"               "fe_dianum2"           
## [37] "fe_amount_tsh2"        "fe_population2"
x <- inspect_cat(datMod2) 
show_plot(x)

Se encontro un outlier en el campo lga, aun muestra demasiadas categorias, por lo cual agrupamos los valores con un %<0.2 en una categoria ‘Otros’.

freq(datMod2$lga ,   sort = "inc")
freq(predict_x2$lga, sort = "inc")
lga.labels    <- unique(datMod2$lga)
lga.labels    <- lga.labels[(lga.labels %in% c('Lindi Urban','Nyamagana','Arusha Urban','Kigoma Urban','Moshi Urban','Songea Urban','Bukoba Urban'))]
datMod2$fe_lga <-car::recode(datMod2$lga, "lga.labels = 'Otros'") 


lga.labels    <- unique(predict_x2$lga)
lga.labels    <- lga.labels[(lga.labels %in% c('Lindi Urban','Nyamagana','Arusha Urban','Kigoma Urban','Moshi Urban','Songea Urban','Bukoba Urban'))]
predict_x2$fe_lga <-car::recode(predict_x2$lga, "lga.labels = 'Otros'") 

freq(datMod2$fe_lga, sort = 'dec')
datMod2$lga <- NULL
predict_x2$lga <- NULL

3.4 - Modelos

Se agregan otros Modelos o algoritmos, manteniendo ranger con los mismos hiperparámetros que la iteración A. Los porcentajes de acierto o ‘Acuracies’ se muestran a continuacion: - Ranger sin remuestreo: 0.8097 - Ranger y stratifiedKfolds: 0.7617 - GBM sin remuestreo: 1 <- Debe haber un problema en este caso, por falta de tiempo no he podido resolverlo. - GBM y stratifiedKfolds: 0.6806

Se ve una pequena mejora en el modelo Ranger sin muestreo, que fue 0.8074 en el intento A.

3.4.1 Ranger sin muestreo

# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(2023)
validationIndex <- createDataPartition(datMod2$status_group, p=0.80, list=FALSE)

# select 20% of the data for validation
my_test2  <-datMod2[-validationIndex,]
# use the remaining 80% of data to training and testing the 
my_train2 <-datMod2[validationIndex,]


tic()
set.seed(1)
fit_r2 <- ranger(
               status_group ~. , 
               data = my_train2,
               num.trees = fit_r1$num.trees,
               importance = 'impurity',
               write.forest = TRUE,
               min.node.size = fit_r1$min.node.size,
               splitrule = "gini",
               verbose = TRUE,
               classification = TRUE
             )

toc()
## 14.126 sec elapsed
fit_r2
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.91 %
print(fit_r2)
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.91 %
summary(fit_r2)
##                           Length Class         Mode     
## predictions               47522  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## variable.importance          38  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              9  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                         10  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical
vars_imp2 <- fit_r2$variable.importance

vars_imp2 <- as.data.frame(vars_imp2)

vars_imp2$myvar <- rownames(vars_imp2)


library(ggpubr) # aunque ya estaba cargada al principio.
ggbarplot(vars_imp2[1:10,],
          x = "myvar", y = "vars_imp2",
          #fill  = 'myvar',
          color = "blue",             # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in descending order
          sort.by.groups = FALSE,     # Don't sort inside each group
          x.text.angle = 90,          # Rotate vertically x axis texts
          ylab = "Importancia",
          xlab = 'Variable', 
          #legend.title = "MPG Group",
          rotate = TRUE,
          ggtheme = theme_minimal(),
          main = "ranger con dataMod2"
          )

fit_r2
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.91 %
print(fit_r2)
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.91 %
summary(fit_r2)
##                           Length Class         Mode     
## predictions               47522  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## variable.importance          38  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              9  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                         10  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical
validation.fit_r2 <- predict(fit_r2, data = my_test2)
confusionMatrix( my_test2$status_group, validation.fit_r2$predictions)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    5805                     140            506
##   functional needs repair        443                     289            131
##   non functional                 954                      86           3524
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8097          
##                  95% CI : (0.8026, 0.8168)
##     No Information Rate : 0.6063          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.643           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.8060                        0.56117
## Specificity                     0.8618                        0.94949
## Pos Pred Value                  0.8999                        0.33488
## Neg Pred Value                  0.7426                        0.97948
## Prevalence                      0.6063                        0.04336
## Detection Rate                  0.4887                        0.02433
## Detection Prevalence            0.5431                        0.07266
## Balanced Accuracy               0.8339                        0.75533
##                      Class: non functional
## Sensitivity                         0.8469
## Specificity                         0.8652
## Pos Pred Value                      0.7721
## Neg Pred Value                      0.9129
## Prevalence                          0.3503
## Detection Rate                      0.2967
## Detection Prevalence                0.3842
## Balanced Accuracy                   0.8561
prediction.fit_r2 <- predict(fit_r2, data = predict_x2)
file.prediction.fit_r2 <- cbind.data.frame(test_id, prediction.fit_r2$predictions)
colnames(file.prediction.fit_r2) <- c("id","status_group")
fwrite(file.prediction.fit_r2, file ="predi_fit_r2.csv",sep=)

3.4.2 Ranger con Stratifield

set.seed(1)
my_train_l <- copy(my_train2)
my_test_l <- copy(my_test2)

# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')

n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)


trainControl <- trainControl(index = cvIndex,
                             method = "cv", 
                             number = n.folds,
                             classProbs = TRUE
                            )

rf.grid<-expand.grid( mtry =  fit_r1$mtry,              
              splitrule  = 'gini',
             min.node.size = fit_r1$min.node.size
 
 )

tic()
set.seed(1)
fit_r21 <- train( 
             status_group ~., 
             data      = my_train_l, 
             method    = "ranger", 
             metric    = "Accuracy",
             importance = 'impurity',
             tuneGrid = rf.grid,
             num.trees = fit_r1$num.trees,
             trControl = trainControl,
             verbose = TRUE
            )

toc()
## 138.356 sec elapsed
validation.fit_r21 <- predict(fit_r21, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_r21)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    f  fnr   nf
##        f   6004   19  428
##        fnr  645   81  137
##        nf  1581   20 2963
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7617         
##                  95% CI : (0.754, 0.7694)
##     No Information Rate : 0.6929         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5318         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: f Class: fnr Class: nf
## Sensitivity            0.7295   0.675000    0.8399
## Specificity            0.8775   0.933492    0.8083
## Pos Pred Value         0.9307   0.093859    0.6492
## Neg Pred Value         0.5898   0.996459    0.9228
## Prevalence             0.6929   0.010103    0.2970
## Detection Rate         0.5055   0.006819    0.2495
## Detection Prevalence   0.5431   0.072655    0.3842
## Balanced Accuracy      0.8035   0.804246    0.8241
varImp(fit_r21)
## ranger variable importance
## 
##   only 20 most important variables shown (out of 434)
## 
##                                   Overall
## waterpoint_type_groupother         100.00
## quantity_groupenough                94.46
## quantityenough                      85.29
## longitude                           84.45
## waterpoint_typeother                79.33
## latitude                            73.81
## extraction_type_groupother          71.87
## extraction_type_classother          70.01
## gps_height                          58.58
## extraction_typeother                57.79
## fe_amount_tsh2                      55.69
## fe_population2                      45.94
## payment_typenever pay               45.46
## waterpoint_typecommunal standpipe   37.96
## extraction_typegravity              33.18
## waterpoint_typehand pump            33.01
## waterpoint_type_grouphand pump      28.88
## quantityinsufficient                28.64
## managementvwc                       28.62
## extraction_type_groupgravity        25.93
#predictions <- predict(fit, newdata = my_test)
#confusionMatrix( my_test$fe_arrest, predictions)

3.4.3 GBM sin muestreo

tic()
set.seed(1)
my_train2 %>% mutate_if(is.character, as.factor) -> my_train_f
my_test2 %>% mutate_if(is.character, as.factor) -> my_test_f

fit_g2 <- gbm.fit(my_train_f[ ,c(1:38)], my_train_f$status_group,
   distribution = "multinomial",
   n.trees = fit_r1$num.trees,
   interaction.depth = 2,
   n.minobsinnode = fit_r1$min.node.size,
   shrinkage = 0.01,
   bag.fraction = 1,
)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0688             nan     0.0100       nan
##      3        1.0402             nan     0.0100       nan
##      4        1.0125             nan     0.0100       nan
##      5        0.9859             nan     0.0100       nan
##      6        0.9601             nan     0.0100       nan
##      7        0.9353             nan     0.0100       nan
##      8        0.9113             nan     0.0100       nan
##      9        0.8880             nan     0.0100       nan
##     10        0.8656             nan     0.0100       nan
##     20        0.6758             nan     0.0100       nan
##     40        0.4255             nan     0.0100       nan
##     60        0.2746             nan     0.0100       nan
##     80        0.1797             nan     0.0100       nan
##    100        0.1186             nan     0.0100       nan
toc()
## 32.964 sec elapsed
fit_g2
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 10 had non-zero influence.
print(fit_g2)
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 10 had non-zero influence.
summary(fit_g2)
varimp_g2 <- kable(relative.influence(fit_g2, n.trees = 100)  )
validation.fit_g2 = predict.gbm(fit_g2,n.trees=100, newdata=my_test_f[ ,c(1:38)],type='response')
p.validation.fit_g2 <- apply(validation.fit_g2, 1, which.max)
validation.fit_g2 <- as.factor( colnames(validation.fit_g2)[p.validation.fit_g2])
confusionMatrix( validation.fit_g2,my_test_f$status_group)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    6451                       0              0
##   functional needs repair          0                     863              0
##   non functional                   0                       0           4564
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.5431     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     1.0000                        1.00000
## Specificity                     1.0000                        1.00000
## Pos Pred Value                  1.0000                        1.00000
## Neg Pred Value                  1.0000                        1.00000
## Prevalence                      0.5431                        0.07266
## Detection Rate                  0.5431                        0.07266
## Detection Prevalence            0.5431                        0.07266
## Balanced Accuracy               1.0000                        1.00000
##                      Class: non functional
## Sensitivity                         1.0000
## Specificity                         1.0000
## Pos Pred Value                      1.0000
## Neg Pred Value                      1.0000
## Prevalence                          0.3842
## Detection Rate                      0.3842
## Detection Prevalence                0.3842
## Balanced Accuracy                   1.0000

3.4.4 GBM con StratifiedKfolds

my_train_l <- copy(my_train2)
my_test_l <- copy(my_test2)

# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
tic()
set.seed(1)

n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)


trainControl <- trainControl(index = cvIndex,
                             method = "cv", 
                             number = n.folds,
                             classProbs = TRUE,
                            )

gbmgrid<-expand.grid(shrinkage=c(0.01),
 n.minobsinnode=c(fit_r1$min.node.size),
 n.trees=c(fit_r1$num.trees),
 interaction.depth=c(2))



set.seed(1)
fit_g21 <- train( 
             status_group~., 
             data      = my_train_l, 
             method    = "gbm", 
             metric    = "Accuracy",
             bag.fraction=1,
             distribution="multinomial",
             trControl = trainControl,
             tuneGrid=gbmgrid,
             verbose = TRUE
            )
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0672             nan     0.0100       nan
##      6        1.0599             nan     0.0100       nan
##      7        1.0529             nan     0.0100       nan
##      8        1.0460             nan     0.0100       nan
##      9        1.0394             nan     0.0100       nan
##     10        1.0329             nan     0.0100       nan
##     20        0.9777             nan     0.0100       nan
##     40        0.9030             nan     0.0100       nan
##     60        0.8564             nan     0.0100       nan
##     80        0.8245             nan     0.0100       nan
##    100        0.8013             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0672             nan     0.0100       nan
##      6        1.0599             nan     0.0100       nan
##      7        1.0529             nan     0.0100       nan
##      8        1.0460             nan     0.0100       nan
##      9        1.0394             nan     0.0100       nan
##     10        1.0329             nan     0.0100       nan
##     20        0.9776             nan     0.0100       nan
##     40        0.9027             nan     0.0100       nan
##     60        0.8556             nan     0.0100       nan
##     80        0.8236             nan     0.0100       nan
##    100        0.8005             nan     0.0100       nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 78: scheme_managementNone has no variation.
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0746             nan     0.0100       nan
##      5        1.0671             nan     0.0100       nan
##      6        1.0599             nan     0.0100       nan
##      7        1.0528             nan     0.0100       nan
##      8        1.0460             nan     0.0100       nan
##      9        1.0393             nan     0.0100       nan
##     10        1.0329             nan     0.0100       nan
##     20        0.9776             nan     0.0100       nan
##     40        0.9025             nan     0.0100       nan
##     60        0.8556             nan     0.0100       nan
##     80        0.8239             nan     0.0100       nan
##    100        0.8010             nan     0.0100       nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 48: region_code40 has no variation.
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0903             nan     0.0100       nan
##      3        1.0823             nan     0.0100       nan
##      4        1.0746             nan     0.0100       nan
##      5        1.0671             nan     0.0100       nan
##      6        1.0598             nan     0.0100       nan
##      7        1.0527             nan     0.0100       nan
##      8        1.0459             nan     0.0100       nan
##      9        1.0392             nan     0.0100       nan
##     10        1.0328             nan     0.0100       nan
##     20        0.9774             nan     0.0100       nan
##     40        0.9028             nan     0.0100       nan
##     60        0.8562             nan     0.0100       nan
##     80        0.8244             nan     0.0100       nan
##    100        0.8010             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0673             nan     0.0100       nan
##      6        1.0600             nan     0.0100       nan
##      7        1.0530             nan     0.0100       nan
##      8        1.0462             nan     0.0100       nan
##      9        1.0396             nan     0.0100       nan
##     10        1.0332             nan     0.0100       nan
##     20        0.9781             nan     0.0100       nan
##     40        0.9036             nan     0.0100       nan
##     60        0.8572             nan     0.0100       nan
##     80        0.8253             nan     0.0100       nan
##    100        0.8029             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0672             nan     0.0100       nan
##      6        1.0599             nan     0.0100       nan
##      7        1.0529             nan     0.0100       nan
##      8        1.0460             nan     0.0100       nan
##      9        1.0394             nan     0.0100       nan
##     10        1.0329             nan     0.0100       nan
##     20        0.9777             nan     0.0100       nan
##     40        0.9029             nan     0.0100       nan
##     60        0.8565             nan     0.0100       nan
##     80        0.8246             nan     0.0100       nan
##    100        0.8015             nan     0.0100       nan
toc()
## 1402.459 sec elapsed
validation.fit_g21 <- predict(fit_g21, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_g21)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    f  fnr   nf
##        f   5901    0  550
##        fnr  733    0  130
##        nf  2381    0 2183
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6806         
##                  95% CI : (0.6721, 0.689)
##     No Information Rate : 0.759          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.355          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
## 
## Statistics by Class:
## 
##                      Class: f Class: fnr Class: nf
## Sensitivity            0.6546         NA    0.7625
## Specificity            0.8079    0.92734    0.7359
## Pos Pred Value         0.9147         NA    0.4783
## Neg Pred Value         0.4262         NA    0.9070
## Prevalence             0.7590    0.00000    0.2410
## Detection Rate         0.4968    0.00000    0.1838
## Detection Prevalence   0.5431    0.07266    0.3842
## Balanced Accuracy      0.7312         NA    0.7492
varImp(fit_g21)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 434)
## 
##                                             Overall
## waterpoint_typeother                       100.0000
## quantityenough                              74.9895
## extraction_typeother                        58.6767
## fe_amount_tsh2                              23.0301
## waterpoint_typecommunal standpipe multiple  17.7459
## quantityinsufficient                        14.9790
## longitude                                   10.3591
## water_qualityunknown                         4.7173
## payment_typenever pay                        3.5205
## fe_lgaBariadi                                2.7543
## managementvwc                                2.6633
## extraction_typegravity                       2.4734
## regionMwanza                                 2.1422
## fe_lgaKigoma Rural                           1.7873
## sourcerainwater harvesting                   1.7284
## extraction_typenira/tanira                   1.4516
## fe_funderGovernment Of Tanzania              1.3614
## paymentunknown                               0.8093
## fe_lgaChunya                                 0.6117
## sourcemachine dbh                            0.6080

4 Iteracion C

4.1 Variables Categoricas

Vamos a ver qué tipo de balanceo tienen las variables categoricas basin y region explorando sus frecuencias.

res_target <- round( prop.table(table(datIn$basin))*100, 2)
kable(res_target)
Var1 Freq
Internal 13.11
Lake Nyasa 8.56
Lake Rukwa 4.13
Lake Tanganyika 10.83
Lake Victoria 17.25
Pangani 15.05
Rufiji 13.43
Ruvuma / Southern Coast 7.56
Wami / Ruvu 10.08
res_target <- round( prop.table(table(datIn$region))*100, 2)
kable(res_target)
Var1 Freq
Arusha 5.64
Dar es Salaam 1.36
Dodoma 3.71
Iringa 8.91
Kagera 5.58
Kigoma 4.74
Kilimanjaro 7.37
Lindi 2.60
Manyara 2.66
Mara 3.31
Mbeya 7.81
Morogoro 6.74
Mtwara 2.91
Mwanza 5.22
Pwani 4.44
Rukwa 3.04
Ruvuma 4.44
Shinyanga 8.39
Singida 3.52
Tabora 3.30
Tanga 4.29

Distribución de basin. ¿Cuántos distritos diferentes tenemos en el conjunto.

unique(datIn[ , .(.N), by = .(basin)][order(-N)])
unique(datIn[ , .(.N), by = .(region)][order(-N)])

4.2 - Frecuencias

En esta versión lo que hacemos es usar las frecuencias de cada categoría en vez de utilizar en one-hot.

datMod3 = copy(datMod2)
predict_x3 <- copy(predict_x2)

# Voy a convertir a Frequencia las variables basin y region
datMod3[ , fe_basin := .N , by = .(basin)]
datMod3[ , fe_region := .N , by = .(region)] 
to_rem <- c('basin', 'region')
datMod3[ , (to_rem) := NULL]


predict_x3[ , fe_basin := .N , by = .(basin)]
predict_x3[ , fe_region := .N , by = .(region)] 
to_rem <- c('basin', 'region')
predict_x3[ , (to_rem) := NULL]

4.3 Modelos

Mismos algoritmos, hiperparámetros y remuestreos que la iteración B. Solo cambian las transformaciones añadidas al train y test. Los resultados son peores en general, la accuracy en validación cae en 3 de los 4 modelos respecto a sus análogos de la iteración B.

  • Ranger sin remuestreo: 0.8147
  • Ranger y stratifiedKfolds: 0.7653
  • GBM sin remuestreo: 1
  • GBM y stratifiedKfolds: 0.6769

Hay una mejora de unos tres puntos porcentuales en el caso de ranger con stratifiedKfolds.

# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(1)
validationIndex <- createDataPartition(datMod3$status_group, p=0.80, list=FALSE)

# select 20% of the data for validation
my_test3  <-datMod3[-validationIndex,]
# use the remaining 80% of data to training and testing the 
my_train3 <-datMod3[validationIndex,]


tic()
set.seed(1)
fit_r3 <- ranger(
  status_group ~. , 
  data = my_train3,
  num.trees = fit_r1$num.trees,
  importance = 'impurity',
  write.forest = TRUE,
  min.node.size = fit_r1$min.node.size,
  splitrule = "gini",
  verbose = TRUE,
  classification = TRUE
)

toc()
## 13.173 sec elapsed
fit_r3
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.96 %
print(fit_r3)
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.96 %
summary(fit_r3)
##                           Length Class         Mode     
## predictions               47522  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## variable.importance          38  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              9  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                         10  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical
vars_imp3 <- fit_r3$variable.importance
vars_imp3 <- as.data.frame(vars_imp3)
vars_imp3$myvar <- rownames(vars_imp3)


library(ggpubr) # aunque ya estaba cargada al principio.
ggbarplot(vars_imp3[1:10,],
          x = "myvar", y = "vars_imp3",
          #fill  = 'myvar',
          color = "blue",             # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in descending order
          sort.by.groups = FALSE,     # Don't sort inside each group
          x.text.angle = 90,          # Rotate vertically x axis texts
          ylab = "Importancia",
          xlab = 'Variable', 
          #legend.title = "MPG Group",
          rotate = TRUE,
          ggtheme = theme_minimal(),
          main = "ranger con dataMod"
)

fit_r3
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.96 %
print(fit_r3)
## Ranger result
## 
## Call:
##  ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees,      importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size,      splitrule = "gini", verbose = TRUE, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      47522 
## Number of independent variables:  38 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             18.96 %
summary(fit_r3)
##                           Length Class         Mode     
## predictions               47522  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## variable.importance          38  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              9  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                         10  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical
validation.fit_r3 <- predict(fit_r3, data = my_test3)
confusionMatrix( my_test3$status_group, validation.fit_r3$predictions)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    5805                     164            482
##   functional needs repair        433                     304            126
##   non functional                 925                      71           3568
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8147          
##                  95% CI : (0.8076, 0.8217)
##     No Information Rate : 0.603           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6531          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.8104                        0.56401
## Specificity                     0.8630                        0.95070
## Pos Pred Value                  0.8999                        0.35226
## Neg Pred Value                  0.7498                        0.97867
## Prevalence                      0.6030                        0.04538
## Detection Rate                  0.4887                        0.02559
## Detection Prevalence            0.5431                        0.07266
## Balanced Accuracy               0.8367                        0.75735
##                      Class: non functional
## Sensitivity                         0.8544
## Specificity                         0.8707
## Pos Pred Value                      0.7818
## Neg Pred Value                      0.9169
## Prevalence                          0.3516
## Detection Rate                      0.3004
## Detection Prevalence                0.3842
## Balanced Accuracy                   0.8625
prediction.fit_r3 <- predict(fit_r3, data = predict_x3)
file.prediction.fit_r3 <- cbind.data.frame(test_id, prediction.fit_r3$predictions)
colnames(file.prediction.fit_r3) <- c("id","status_group")
fwrite(file.prediction.fit_r3, file ="predi_fit_r3.csv",sep=)
my_train_l <- copy(my_train3)
my_test_l <- copy(my_test3)

# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')

n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)

trainControl <- trainControl(index = cvIndex,
                             method = "cv", 
                             number = n.folds,
                             classProbs = TRUE
)

rf.grid<-expand.grid( mtry =  fit_r1$mtry,              
                      splitrule  = 'gini',
                      min.node.size = fit_r1$min.node.size
)

tic()
set.seed(1)
fit_r31 <- train( 
  status_group ~., 
  data      = my_train_l, 
  method    = "ranger", 
  metric    = "Accuracy",
  importance = 'impurity',
  tuneGrid = rf.grid,
  num.trees = fit_r1$num.trees,
  trControl = trainControl,
  verbose = TRUE
)

toc()
## 134.237 sec elapsed
validation.fit_r31 <- predict(fit_r31, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_r31)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    f  fnr   nf
##        f   6010   19  422
##        fnr  644   76  143
##        nf  1543   17 3004
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7653          
##                  95% CI : (0.7576, 0.7729)
##     No Information Rate : 0.6901          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5389          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: f Class: fnr Class: nf
## Sensitivity            0.7332   0.678571    0.8417
## Specificity            0.8802   0.933112    0.8123
## Pos Pred Value         0.9316   0.088065    0.6582
## Neg Pred Value         0.5970   0.996732    0.9228
## Prevalence             0.6901   0.009429    0.3005
## Detection Rate         0.5060   0.006398    0.2529
## Detection Prevalence   0.5431   0.072655    0.3842
## Balanced Accuracy      0.8067   0.805842    0.8270
varImp(fit_r31)
## ranger variable importance
## 
##   only 20 most important variables shown (out of 408)
## 
##                                   Overall
## quantityenough                     100.00
## quantity_groupenough                92.61
## extraction_type_groupother          89.40
## extraction_typeother                87.50
## waterpoint_type_groupother          86.69
## longitude                           86.10
## latitude                            84.50
## extraction_type_classother          75.65
## gps_height                          73.17
## waterpoint_typeother                63.85
## fe_amount_tsh2                      60.78
## fe_region                           52.64
## fe_population2                      45.94
## fe_basin                            44.45
## payment_typenever pay               42.32
## waterpoint_typecommunal standpipe   40.42
## waterpoint_typehand pump            36.96
## extraction_typegravity              36.17
## extraction_type_classhandpump       33.37
## quantity_groupinsufficient          32.77
#predictions <- predict(fit, newdata = my_test) 
#confusionMatrix( my_test$fe_arrest, predictions)

4.3.1 GBM sin muestreo

tic()
set.seed(1)
my_train3 %>% mutate_if(is.character, as.factor) -> my_train_f
my_test3 %>% mutate_if(is.character, as.factor) -> my_test_f

fit_g3 <- gbm.fit(my_train_f[ ,c(1:38)], my_train_f$status_group,
   distribution = "multinomial",
   n.trees = fit_r1$num.trees,
   interaction.depth = 2,
   n.minobsinnode = fit_r1$min.node.size,
   shrinkage = 0.01,
   bag.fraction = 1,
)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0688             nan     0.0100       nan
##      3        1.0402             nan     0.0100       nan
##      4        1.0125             nan     0.0100       nan
##      5        0.9859             nan     0.0100       nan
##      6        0.9601             nan     0.0100       nan
##      7        0.9353             nan     0.0100       nan
##      8        0.9113             nan     0.0100       nan
##      9        0.8880             nan     0.0100       nan
##     10        0.8656             nan     0.0100       nan
##     20        0.6758             nan     0.0100       nan
##     40        0.4255             nan     0.0100       nan
##     60        0.2746             nan     0.0100       nan
##     80        0.1797             nan     0.0100       nan
##    100        0.1186             nan     0.0100       nan
toc()
## 32.214 sec elapsed
fit_g3
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 11 had non-zero influence.
print(fit_g3)
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 11 had non-zero influence.
summary(fit_g3)
varimp_g3 <- kable(relative.influence(fit_g3, n.trees = 100)  )
validation.fit_g3 = predict.gbm(fit_g3,n.trees=100, newdata=my_test_f[ ,c(1:38)],type='response')
p.validation.fit_g3 <- apply(validation.fit_g3, 1, which.max)
validation.fit_g3 <- as.factor( colnames(validation.fit_g3)[p.validation.fit_g3])
confusionMatrix( validation.fit_g3,my_test_f$status_group)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    6451                       0              0
##   functional needs repair          0                     863              0
##   non functional                   0                       0           4564
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.5431     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     1.0000                        1.00000
## Specificity                     1.0000                        1.00000
## Pos Pred Value                  1.0000                        1.00000
## Neg Pred Value                  1.0000                        1.00000
## Prevalence                      0.5431                        0.07266
## Detection Rate                  0.5431                        0.07266
## Detection Prevalence            0.5431                        0.07266
## Balanced Accuracy               1.0000                        1.00000
##                      Class: non functional
## Sensitivity                         1.0000
## Specificity                         1.0000
## Pos Pred Value                      1.0000
## Neg Pred Value                      1.0000
## Prevalence                          0.3842
## Detection Rate                      0.3842
## Detection Prevalence                0.3842
## Balanced Accuracy                   1.0000

4.3.2 GBM con StratifiedKfolds

my_train_l <- copy(my_train3)
my_test_l <- copy(my_test3)

# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
tic()
set.seed(1)

n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)


trainControl <- trainControl(index = cvIndex,
                             method = "cv", 
                             number = n.folds,
                             classProbs = TRUE,
                            )

gbmgrid<-expand.grid(shrinkage=c(0.01),
 n.minobsinnode=c(fit_r1$min.node.size),
 n.trees=c(fit_r1$num.trees),
 interaction.depth=c(2))



set.seed(1)
fit_g31 <- train( 
             status_group~., 
             data      = my_train_l, 
             method    = "gbm", 
             metric    = "Accuracy",
             bag.fraction=1,
             distribution="multinomial",
             trControl = trainControl,
             tuneGrid=gbmgrid,
             verbose = TRUE
            )
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0672             nan     0.0100       nan
##      6        1.0600             nan     0.0100       nan
##      7        1.0530             nan     0.0100       nan
##      8        1.0462             nan     0.0100       nan
##      9        1.0395             nan     0.0100       nan
##     10        1.0331             nan     0.0100       nan
##     20        0.9781             nan     0.0100       nan
##     40        0.9036             nan     0.0100       nan
##     60        0.8572             nan     0.0100       nan
##     80        0.8251             nan     0.0100       nan
##    100        0.8022             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0824             nan     0.0100       nan
##      4        1.0747             nan     0.0100       nan
##      5        1.0672             nan     0.0100       nan
##      6        1.0599             nan     0.0100       nan
##      7        1.0529             nan     0.0100       nan
##      8        1.0460             nan     0.0100       nan
##      9        1.0394             nan     0.0100       nan
##     10        1.0330             nan     0.0100       nan
##     20        0.9778             nan     0.0100       nan
##     40        0.9032             nan     0.0100       nan
##     60        0.8566             nan     0.0100       nan
##     80        0.8247             nan     0.0100       nan
##    100        0.8024             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0825             nan     0.0100       nan
##      4        1.0748             nan     0.0100       nan
##      5        1.0674             nan     0.0100       nan
##      6        1.0602             nan     0.0100       nan
##      7        1.0532             nan     0.0100       nan
##      8        1.0464             nan     0.0100       nan
##      9        1.0398             nan     0.0100       nan
##     10        1.0334             nan     0.0100       nan
##     20        0.9785             nan     0.0100       nan
##     40        0.9039             nan     0.0100       nan
##     60        0.8571             nan     0.0100       nan
##     80        0.8249             nan     0.0100       nan
##    100        0.8024             nan     0.0100       nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 20: region_code40 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 50: scheme_managementNone has no variation.
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0825             nan     0.0100       nan
##      4        1.0749             nan     0.0100       nan
##      5        1.0675             nan     0.0100       nan
##      6        1.0603             nan     0.0100       nan
##      7        1.0533             nan     0.0100       nan
##      8        1.0465             nan     0.0100       nan
##      9        1.0400             nan     0.0100       nan
##     10        1.0336             nan     0.0100       nan
##     20        0.9789             nan     0.0100       nan
##     40        0.9046             nan     0.0100       nan
##     60        0.8578             nan     0.0100       nan
##     80        0.8259             nan     0.0100       nan
##    100        0.8033             nan     0.0100       nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 70: extraction_typeother - mkulima/shinyanga has no
## variation.
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0905             nan     0.0100       nan
##      3        1.0826             nan     0.0100       nan
##      4        1.0749             nan     0.0100       nan
##      5        1.0675             nan     0.0100       nan
##      6        1.0604             nan     0.0100       nan
##      7        1.0534             nan     0.0100       nan
##      8        1.0466             nan     0.0100       nan
##      9        1.0401             nan     0.0100       nan
##     10        1.0337             nan     0.0100       nan
##     20        0.9790             nan     0.0100       nan
##     40        0.9046             nan     0.0100       nan
##     60        0.8581             nan     0.0100       nan
##     80        0.8257             nan     0.0100       nan
##    100        0.8029             nan     0.0100       nan
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0986             nan     0.0100       nan
##      2        1.0904             nan     0.0100       nan
##      3        1.0825             nan     0.0100       nan
##      4        1.0748             nan     0.0100       nan
##      5        1.0674             nan     0.0100       nan
##      6        1.0601             nan     0.0100       nan
##      7        1.0531             nan     0.0100       nan
##      8        1.0464             nan     0.0100       nan
##      9        1.0398             nan     0.0100       nan
##     10        1.0334             nan     0.0100       nan
##     20        0.9785             nan     0.0100       nan
##     40        0.9041             nan     0.0100       nan
##     60        0.8574             nan     0.0100       nan
##     80        0.8252             nan     0.0100       nan
##    100        0.8026             nan     0.0100       nan
toc()
## 1328.609 sec elapsed
validation.fit_g31 <- predict(fit_g31, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_g31)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    f  fnr   nf
##        f   5842    2  607
##        fnr  738    1  124
##        nf  2362    5 2197
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6769          
##                  95% CI : (0.6684, 0.6853)
##     No Information Rate : 0.7528          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.349           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: f Class: fnr Class: nf
## Sensitivity            0.6533  1.250e-01    0.7503
## Specificity            0.7926  9.274e-01    0.7355
## Pos Pred Value         0.9056  1.159e-03    0.4814
## Neg Pred Value         0.4288  9.994e-01    0.9001
## Prevalence             0.7528  6.735e-04    0.2465
## Detection Rate         0.4918  8.419e-05    0.1850
## Detection Prevalence   0.5431  7.266e-02    0.3842
## Balanced Accuracy      0.7229  5.262e-01    0.7429
varImp(fit_g31)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 408)
## 
##                                             Overall
## waterpoint_typeother                       100.0000
## quantityenough                              76.3676
## extraction_typeother                        56.0496
## fe_amount_tsh2                              22.4840
## waterpoint_typecommunal standpipe multiple  20.9026
## quantityinsufficient                        15.3666
## longitude                                    9.6876
## payment_typenever pay                        5.4194
## water_qualityunknown                         4.3711
## fe_region                                    4.1316
## fe_lgaBariadi                                3.2110
## fe_lgaKigoma Rural                           2.7922
## managementvwc                                2.1924
## fe_funderGovernment Of Tanzania              2.0766
## extraction_typegravity                       1.8127
## region_code19                                1.8001
## extraction_typenira/tanira                   0.9558
## sourcerainwater harvesting                   0.6026
## fe_basin                                     0.5972
## fe_public_meetingDesconocido                 0.3309

5 Concluciones

  1. El mejor resultado fue producido utilizando modelo Random Forest y usando la libreria ranger. Y siendo la mejor iteracion es la opcion B, donde fueron incluidas las variables numericas, y usando el metodo de lumping en las variables categoricas que presentaban muchas categorias. Esta opcion alcanzo un score de 0.8174 en la plataforma del concurso (ver imagen adjunta).

  2. La siguiente etapa deberia ser el tuneado del modelo, explorando cambios en los hiperparametros del mejor modelo, debido a limites computacionales y tiempo no logre completar esa segunda etapa.

  3. Otras transformaciones en los campos de latitud, longitud y altura de los acuiferos podrian ser exploradas a futuro, debido a limitaciones de tiempo esta parte no logro ser realizada en el presente trabajo.

Resultado en la plataforma, notar que se hicieron 2 ingresos en la plataforma, debido a pequenas correcciones en el codigo. Los resultados A2 B2 y C2 son los usados en las conclusiones.
Resultado en la plataforma, notar que se hicieron 2 ingresos en la plataforma, debido a pequenas correcciones en el codigo. Los resultados A2 B2 y C2 son los usados en las conclusiones.